The health-care sector is by far the most consequential part of the US economy. It is an important part of people’s lives, supporting their health and well-being. One Interesting fact is that the health-sector employs around 11% of American workers. So, for a well-functioning health-care sector is a need of well-functioning economy. Unfortunately, some problems with US health-care is substantial. It is because US spends more than any other country in terms of GDP even though the cost per capita that a person spends on average in US is also the highest. The historical studies suggests that USA spends around 10-14% of GDP in the health care. So this is kind of one motivation for us to study the problem and get to know what may be the specific reason for this and we are going to do some analysis to uncover the problem with the help of Statistics.
To realize their profit, insurance companies must charge a higher premium than the amount paid to the insured. For this reason, insurance companies invest a lot of time, effort, and money in creating models that are able to accurately predict health care costs. In order to fulfill this mission, we will first analyze the factors that influence medical loads and secondly try to build an adequate model and optimize its performance.
While using our data for regression tests both linear and multiple linear, we see which variables affect the medical cost and which do not. Using the Medical Cost Personal database, we discover that the variables age and bmi are continuous variables; the variables sex, smoker and region are categorical variables. With the help of statistics we see that there’re no variables missing from the database. By running through each variable it will help us interpret the model later down the line to see which variables affect the medical costs. During this process we might encounter problems with the categorical variables.
This data is taken from Kaagle which was primarily collected by Brent Lantz for his book “Machine Learning with R” which was later made public.
1.Age: the age of the insured (recipients). 2.Sex: sex of insured persons; “male” or “female”. 3.bmi: body mass index, providing an understanding of the body, relatively high or low weights relative to height, 4.children: number of children covered by health insurance / number of dependents. 5.smoker: does the insured smoke or not. 6.region: the recipient’s residential area in the United States; northeast, southeast, southwest, northwest. 7.charges: Individual medical costs billed by health insurance.
We can see that the varibles Sex,Smoker,Children and Region seems to be categorical.
insurance_data <- read.csv("/Users/harshpatel/Desktop/insurance.csv")
str(insurance_data)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
summary(insurance_data)
## age sex bmi children
## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
sum(is.na(insurance_data))
## [1] 0
We found that there is no missing values so can go ahead with the original dataset.
library(ggplot2)
library(gridExtra)
plot.sex <- ggplot(insurance_data, aes(x = sex, y = charges)) +
geom_boxplot()
plot.smoker <- ggplot(insurance_data, aes(x = smoker, y = charges)) +
geom_boxplot()
plot.child <- ggplot(insurance_data, aes(x = as.factor(children), y = charges)) +
geom_boxplot()
plot.region <- ggplot(insurance_data, aes(x = region, y = charges)) +
geom_boxplot()
grid.arrange(plot.sex, plot.smoker, plot.child, plot.region, ncol=2, nrow=2)
The first boxplot (left upper corner) shows us that females and males pay on average the same charges. When looking at the second boxplot (right upper corner) we see that smokers pay higher charges compared to non smokers. Also people with more children pay more charges and it seems that the region has not an influence on the charges. In all instances the charges have a skewed distribution.
insurance_data$sex=as.factor(insurance_data$sex)
insurance_data$smoker=as.factor(insurance_data$smoker)
insurance_data$children=as.factor(insurance_data$children)
insurance_data$region=as.factor(insurance_data$region)
str(insurance_data)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: Factor w/ 6 levels "0","1","2","3",..: 1 2 4 1 1 1 2 4 3 1 ...
## $ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
## $ charges : num 16885 1726 4449 21984 3867 ...
source('/Users/harshpatel/Desktop/pairs.r')
pairs(insurance_data[,c(1,3,4,7)],panel=panel.smooth,diag.panel=panel.hist,lower.panel=panel.cor)
We need to transform the intended categorical variables to be treated as factor because some of the variables were non-numeric so they needed to be converted for the analysis and to visualize the pairplot.As per our understanding the response variable is not to be transformed but looking at the pairplot it seems that taking some variables as polynomial may provide more significance and higher model accuracy.
So for the preliminary model we tried fitting the simple linear model taking all the variables to predict the response variable that is charges.
mod1<-lm(charges~age+sex+bmi+children+smoker+region,data=insurance_data)
summary(mod1)
##
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker +
## region, data = insurance_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11689.4 -2902.6 -943.7 1492.2 30042.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11927.17 993.66 -12.003 < 2e-16 ***
## age 257.19 11.91 21.587 < 2e-16 ***
## sexmale -128.16 332.83 -0.385 0.700254
## bmi 336.91 28.61 11.775 < 2e-16 ***
## children1 390.98 421.35 0.928 0.353619
## children2 1635.78 466.67 3.505 0.000471 ***
## children3 964.34 548.10 1.759 0.078735 .
## children4 2947.37 1239.16 2.379 0.017524 *
## children5 1116.04 1456.02 0.767 0.443514
## smokeryes 23836.41 414.14 57.557 < 2e-16 ***
## regionnorthwest -380.04 476.56 -0.797 0.425318
## regionsoutheast -1033.14 479.14 -2.156 0.031245 *
## regionsouthwest -952.89 478.15 -1.993 0.046483 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6059 on 1325 degrees of freedom
## Multiple R-squared: 0.7519, Adjusted R-squared: 0.7497
## F-statistic: 334.7 on 12 and 1325 DF, p-value: < 2.2e-16
Looking at the summary of the model, it can be found that lot of variables are not significant which can be given as sex, children, region north west. It is surprising that though a lot of variables are not significant, we got a decent R2 value and the F-statistic also seems high so overall the model performed well. But this might be because some variables might be collinear.
library(car)
## Loading required package: carData
# VIF
vif(mod1)
## GVIF Df GVIF^(1/(2*Df))
## age 1.020607 1 1.010251
## sex 1.009334 1 1.004656
## bmi 1.108836 1 1.053013
## children 1.024871 5 1.002460
## smoker 1.018024 1 1.008972
## region 1.109919 3 1.017533
# tolerance
1/vif(mod1)
## GVIF Df GVIF^(1/(2*Df))
## age 0.9798087 1.0000000 0.9898529
## sex 0.9907520 1.0000000 0.9953653
## bmi 0.9018469 1.0000000 0.9496562
## children 0.9757323 0.2000000 0.9975463
## smoker 0.9822954 1.0000000 0.9911081
## region 0.9009668 0.3333333 0.9827690
# mean VIF
mean(vif(mod1))
## [1] 1.354915
A VIF larger than 10 indicates multicolinearity. We have no VIF larger than 10, so this indicates no multicollinearity. A tolerance lower 0.2 indicates problems. In our cases the tolerance levels are all higher than 0.2, thus no problems here. Also the mean VIF is around 1, so no bias here.
plot(mod1)
We have plotted four graphs. The first graph shows the fitted values against the observed residuals. In this plot the dots should be randomly placed around the horizontal zero line, which is clearly not the case. It seems that there are three groups present. Thus per group there is difference in variance across the residuals.
There is increasing variance across the residuals (heteroscedasticity), and there might also be a non-linear relationship between the outcome and the predictor. This non-linear relationship is also reflected by the second plot (Q-Q plot), since the dots deviate from the dotted line.
library(ggplot2)
library(gridExtra)
plot.age <- ggplot(insurance_data, aes(x = age, y = charges)) +
geom_point()
plot.age2 <- ggplot(insurance_data, aes(x = age^2, y = charges)) +
geom_point()
grid.arrange(plot.age, plot.age2, ncol=2)
When looking at the first image, we can see a slight U shape while in the seciond image which is age^2, we can see a linear trend so we can use age^2.
library(ggplot2)
library(gridExtra)
plot.age <- ggplot(insurance_data, aes(x = bmi, y = charges)) +
geom_point()
plot.age2 <- ggplot(insurance_data, aes(x = bmi^2, y = charges)) +
geom_point()
grid.arrange(plot.age, plot.age2, ncol=2)
By plotting bmi it doesn’t seem to have a polynomial degree as the samples are spread out so we will not use any kind of polynomial on the variable bmi.
mod2 <- lm(charges ~ smoker + poly(age,2) + bmi + children + region, data = insurance_data)
summary(mod2)
##
## Call:
## lm(formula = charges ~ smoker + poly(age, 2) + bmi + children +
## region, data = insurance_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12630.8 -2869.6 -908.2 1436.3 31028.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2065.86 915.35 -2.257 0.02418 *
## smokeryes 23835.48 410.29 58.094 < 2e-16 ***
## poly(age, 2)1 131863.23 6080.33 21.687 < 2e-16 ***
## poly(age, 2)2 27045.69 6499.70 4.161 3.37e-05 ***
## bmi 331.94 28.42 11.681 < 2e-16 ***
## children1 933.81 438.63 2.129 0.03344 *
## children2 2231.83 485.34 4.599 4.66e-06 ***
## children3 1480.80 558.61 2.651 0.00812 **
## children4 3501.60 1238.42 2.827 0.00476 **
## children5 1782.50 1455.56 1.225 0.22094
## regionnorthwest -404.00 473.54 -0.853 0.39372
## regionsoutheast -1042.73 476.07 -2.190 0.02868 *
## regionsouthwest -950.81 475.08 -2.001 0.04556 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6020 on 1325 degrees of freedom
## Multiple R-squared: 0.7551, Adjusted R-squared: 0.7529
## F-statistic: 340.5 on 12 and 1325 DF, p-value: < 2.2e-16
Quantitative predictors: bmi,age Categorical predictors: sex,smoker,children,region
library(car)
scatterplot( charges~bmi | smoker,data=insurance_data) #influence of smoker
scatterplot( charges~bmi |sex ,data=insurance_data) #influence of sex
scatterplot( charges~bmi | region,data=insurance_data) #influence of smoker
By Plotting the categorical variables to see the interaction effect with bmi, it was found that all the categorical variables shows interaction effect so we’ll use interaction of bmi with sex,smoker and region.
scatterplot( charges~age | sex,data=insurance_data) #influence of smoker
scatterplot( charges~age | smoker,data=insurance_data) #influence of smoker
scatterplot( charges~age | region,data=insurance_data) #influence of smoker
While plotting age with the categorical variables it was found that all those has a additive effect with age so its better not to use any kind of interaction with age.
mod3=lm(charges~bmi*sex,data=insurance_data)
summary(mod3)
##
## Call:
## lm(formula = charges ~ bmi * sex, data = insurance_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23378 -8115 -3819 4773 48914
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3543.81 2362.31 1.500 0.133813
## bmi 297.12 76.27 3.896 0.000103 ***
## sexmale -4349.36 3328.11 -1.307 0.191487
## bmi:sexmale 179.96 106.49 1.690 0.091273 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11860 on 1334 degrees of freedom
## Multiple R-squared: 0.0437, Adjusted R-squared: 0.04155
## F-statistic: 20.32 on 3 and 1334 DF, p-value: 7.029e-13
We can still see that in the model, sex is still not significant and the interaction as well so it is better not to use this interaction.
mod4=lm(charges~bmi*smoker,data=insurance_data)
summary(mod4)
##
## Call:
## lm(formula = charges ~ bmi * smoker, data = insurance_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19768.0 -4400.7 -869.5 2957.7 31055.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5879.42 976.87 6.019 2.27e-09 ***
## bmi 83.35 31.27 2.666 0.00778 **
## smokeryes -19066.00 2092.03 -9.114 < 2e-16 ***
## bmi:smokeryes 1389.76 66.78 20.810 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6161 on 1334 degrees of freedom
## Multiple R-squared: 0.7418, Adjusted R-squared: 0.7412
## F-statistic: 1277 on 3 and 1334 DF, p-value: < 2.2e-16
We can see that the model with only the interaction og bmi and smoker performs well and gives us around 74% accuracy while the other doesn’t perform well so now we will try to perform a model with most significant interaction.
mod5=lm(charges~bmi+bmi:smoker+smoker+poly(age,2)+region+children,data=insurance_data)
summary(mod5)
##
## Call:
## lm(formula = charges ~ bmi + bmi:smoker + smoker + poly(age,
## 2) + region + children, data = insurance_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14933.9 -1721.5 -1242.0 -707.9 31279.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7773.51 813.49 9.556 < 2e-16 ***
## bmi 17.45 25.40 0.687 0.492332
## smokeryes -20173.04 1633.65 -12.348 < 2e-16 ***
## poly(age, 2)1 135664.02 4854.81 27.944 < 2e-16 ***
## poly(age, 2)2 25039.38 5188.06 4.826 1.55e-06 ***
## regionnorthwest -617.73 378.02 -1.634 0.102471
## regionsoutheast -1212.13 380.01 -3.190 0.001457 **
## regionsouthwest -1223.21 379.30 -3.225 0.001291 **
## children1 820.24 350.10 2.343 0.019284 *
## children2 2080.05 387.40 5.369 9.32e-08 ***
## children3 1483.40 445.84 3.327 0.000901 ***
## children4 3917.29 988.53 3.963 7.80e-05 ***
## children5 2429.44 1161.95 2.091 0.036734 *
## bmi:smokeryes 1434.08 52.15 27.497 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4805 on 1324 degrees of freedom
## Multiple R-squared: 0.8441, Adjusted R-squared: 0.8426
## F-statistic: 551.5 on 13 and 1324 DF, p-value: < 2.2e-16
We conclude that when comparing the interaction model with the regular model, the interaction model performs well giving an accuracy of around 84% and fitting the values well which we can see in the summary that almost all the variables have significant p-value which strengthens our claim and also the R2 value is higher.
library(car)
influencePlot(mod5)
## StudRes Hat CookD
## 439 -0.0898323 0.064102618 3.951027e-05
## 517 5.0746727 0.008701754 1.585062e-02
## 1013 4.0151674 0.047387259 5.663598e-02
## 1048 -3.2039227 0.052248865 4.014117e-02
## 1086 0.9727164 0.074225481 5.418884e-03
## 1301 6.6439458 0.008535099 2.628631e-02
By plotting the influence plot, it was found that 5 obs were considered as influential which can be given by 517,578,861,1048,1301. This analysis does not mean that we should immediately delete the observations but rather we should see why they are influential and we can do that by different influence measures which we will try next. While looking at the influence plot at first glance we can see that 1048,861 seems to be outliers as they are far from the other values.
par(mfrow=c(2,2))
plot(mod5,1)
plot(mod5,3)
plot(mod5,5)
plot(mod5,6)
Now we have found out the influential points and we have to do something to handle those. The best way is to drop those rows having influential points or outliers as we have a large dataset and we can do this for a number of reasons. To mention few, the mean of the data is affected due to an outliers and also we can find a drop in correlation if there is one. And we need to remove the influential points because by doing that we can improve the fitting of a regression equation and get a better fit and predicted values. So we try to remove those influential points and try to see if the model accuracy changes or not.
new_data<-insurance_data[-c(439,517,1013,1048,1086,1301)]
model_woi<-lm(charges~bmi+bmi:smoker+smoker+poly(age,2)+region+children,data=new_data)
summary(model_woi)
##
## Call:
## lm(formula = charges ~ bmi + bmi:smoker + smoker + poly(age,
## 2) + region + children, data = new_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14933.9 -1721.5 -1242.0 -707.9 31279.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7773.51 813.49 9.556 < 2e-16 ***
## bmi 17.45 25.40 0.687 0.492332
## smokeryes -20173.04 1633.65 -12.348 < 2e-16 ***
## poly(age, 2)1 135664.02 4854.81 27.944 < 2e-16 ***
## poly(age, 2)2 25039.38 5188.06 4.826 1.55e-06 ***
## regionnorthwest -617.73 378.02 -1.634 0.102471
## regionsoutheast -1212.13 380.01 -3.190 0.001457 **
## regionsouthwest -1223.21 379.30 -3.225 0.001291 **
## children1 820.24 350.10 2.343 0.019284 *
## children2 2080.05 387.40 5.369 9.32e-08 ***
## children3 1483.40 445.84 3.327 0.000901 ***
## children4 3917.29 988.53 3.963 7.80e-05 ***
## children5 2429.44 1161.95 2.091 0.036734 *
## bmi:smokeryes 1434.08 52.15 27.497 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4805 on 1324 degrees of freedom
## Multiple R-squared: 0.8441, Adjusted R-squared: 0.8426
## F-statistic: 551.5 on 13 and 1324 DF, p-value: < 2.2e-16
We can see that by deleting the influential points, there is no effect on the model that we tried earlier.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::combine() masks gridExtra::combine()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::recode() masks car::recode()
## x purrr::some() masks car::some()
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
set.seed(123)
random_sample <- createDataPartition(insurance_data $ charges, p = 0.7, list = FALSE) #Dividing the data in 70:30 ratio
training_dataset <- insurance_data[random_sample, ]
testing_dataset <- insurance_data[-random_sample, ]
mod6 <- lm(charges~bmi+bmi:smoker+smoker+poly(age,2)+region+children,data=training_dataset) # cross-validation model
predictions<-predict(mod6,testing_dataset)
data.frame( R2 = R2(predictions, testing_dataset $ charges),
RMSE = RMSE(predictions, testing_dataset $ charges),
MAE = MAE(predictions, testing_dataset $ charges))
## R2 RMSE MAE
## 1 0.8260944 4969.245 2913.679
We can see that the R2 score for the model is 82% which is lower than the regular model without cross-validation but the model with cross-validation is more accurate.
We will try to find the best possible subset for better accuracy of the model using a ols_step_best_subset which is a function that gives us all possible model with different variables and it corresponding Metrics from which we can determine which variables are significant and whioch are not. Also we can compare the diffent models without hard coeding it using Forward and Backward selection procedures. I decided to go with this because when I was trying to run the step function, it was giving an error saying “function is masked by some different library”. So I decided to do something different.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_step_best_subset(mod6)
## Best Subsets Regression
## -----------------------------------------------------------------
## Model Index Predictors
## -----------------------------------------------------------------
## 1 bmi:smoker
## 2 poly(age, 2) bmi:smoker
## 3 smoker poly(age, 2) bmi:smoker
## 4 smoker poly(age, 2) children bmi:smoker
## 5 smoker poly(age, 2) region children bmi:smoker
## 6 bmi smoker poly(age, 2) region children bmi:smoker
## -----------------------------------------------------------------
##
## Subsets Regression Summary
## --------------------------------------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## --------------------------------------------------------------------------------------------------------------------------------------------------------------
## 1 0.7396 0.7390 0.7378 685.6017 19060.7565 -177919.3362 19080.1315 36483081884.2513 39019161.3318 41642.9324 0.2615
## 2 0.8286 0.8278 0.8267 134.1045 18672.5141 -433475.9918 18701.5766 24040738858.3882 25766843.8666 27499.6752 0.1725
## 3 0.8456 0.8448 0.8433 30.3199 18576.4929 -537576.9943 18610.3991 21678498109.0562 23259714.7342 24824.1665 0.1557
## 4 0.8504 0.8488 0.8466 2.5702 18556.9750 -572678.2605 18615.1000 21029461104.5978 22684780.0981 24210.8372 0.1512
## 5 0.8514 0.8493 0.8467 -2.0000 18556.3288 -580039.6158 18628.9851 20903412197.1657 22621710.8844 24143.8548 0.1505
## 6 0.8514 0.8493 0.8467 0.0000 18556.3288 -578939.8869 18628.9851 20925888984.4741 22645674.5612 24169.8159 0.1508
## --------------------------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria
By performing this, we conclude that the full model 5 and model 6 works best compared to other models and if we go with the KISS principle we can just go with the model 5 as it has 1 less variable and the same accuracy as model 6 which makes it less compelx.
Now we’ll try to run different metrics on the model we created before to see what model works best for us.
source('/Users/harshpatel/Downloads/modelselfunctions.r')
model1<-Criteria(mod1,FullMSE = 10,label=T)
model2<-Criteria(mod2,FullMSE = 10,label=T)
model3<-Criteria(mod3,FullMSE = 10,label=T)
model4<-Criteria(mod4,FullMSE = 10,label=T)
model5<-Criteria(mod5,FullMSE = 10,label=T)
model6<-Criteria(mod6,FullMSE = 10,label=T)
df<-rbind(model1,model2,model3,model4,model5,model6)
df
## p+1 R2adj Cp AIC PRESS
## model1 13 0.7497 4863906994 23318.92 49673726637
## model2 13 0.7529 4801704879 23301.70 49040471884
## model3 4 0.0416 18750487809 25106.39 188739016540
## model4 4 0.7412 5063202800 23354.65 50976180503
## model5 14 0.8426 3056364877 22699.27 31255560631
## model6 14 0.8493 2076959713 15892.40 21436878626
After running different types of metrics, we conclude that some model are fitting similar but according to all metrics, we can see model 6 as the best model which is the onw with cross-validation.
By performing different operations like interactions, polynomial terms, multiple regression, outlier analysis, cross validation, we can conclude that the model with all of these tends to perform well on the data and also the model with cross-validation also performs well on the testing data which makes it more accurate. So the best r2 score that we got during our analysis was around 84%. For further improvement of the accracy as well as better prediction we can try a number of Machine Learning algorithms such as KNN, Random Forest, SVM,etc that will give better prediction accuracy but to the end we can say that the Regression method which is the simplest of all works pretty well.
So we as mentioned in the introduction, the healthcare costs in US are more than anywhere in the world and to study this, there are a ton of literature out there. But we try to study the one written by David.W.Bates at the health affairs forum and the paper goes by “Big Data In Healthcare”.
The US medical care framework is quickly taking on electronic well being records, which will drastically expand the amount of clinical information that are accessible electronically. All the while, fast advancement has been made in clinical examination—strategies for dissecting huge amounts of information and gathering new bits of knowledge from that investigation—which is important for what is known as big data.
A health insurance company can only make money if it collects more than it spends on the medical care of its beneficiaries. On the other hand, even though some conditions are more prevalent for certain segments of the population, medical costs are difficult to predict since most money comes from rare conditions of the patients.
So the objective is to accurately predict insurance costs based on people’s data, including age, Body Mass Index, smoking or not, etc. Additionally, we will also determine what the most important variable influencing insurance costs is. These estimates could be used to create actuarial tables that set the price of yearly premiums higher or lower according to the expected treatment costs which is mainly considered as regression problem.
https://www.healthaffairs.org/doi/10.1377/hlthaff.2014.0041 https://helioscompany.com/medical-billing-data-analytics/ https://www.brookings.edu/research/the-opportunities-and-challenges-of-data-analytics-in-health-care/ https://www.kaggle.com/sinaasappel/tutorial-multiple-regression/notebook https://www.kaggle.com/sinaasappel/tutorial-multiple-regression/data